detect: add chelae detect subcommand for adapter identification by tfenne · Pull Request #8 · fulcrumgenomics/chelae

tfenne · 2026-06-26T12:55:46Z

Summary

Adds a second subcommand, chelae detect, that identifies which 3' adapter sequence(s) are present in one or two FASTQ files.

PE mode discovers adapters via R1/R2 overlap (no kit knowledge required), builds a position-by-position consensus across reads in each near-identical k-mer cluster, and reports the consensus extending while per-position coverage stays above an adaptive floor.
SE mode scores each read against every built-in kit adapter plus any user --adapter-sequence / --adapter-fasta candidate via the same scanner trim uses.
3'-end cleanups (poly-G, poly-X, cut-right quality) are applied to every read before the probe, with aggressive defaults reflecting detect's specificity bias (vs trim's sensitivity bias).
Console output is two sections: matched kit(s) with their published adapter sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking any per-sample extension (typically the i7/i5 barcode tail).
Optional FASTA output is cross-sample-portable: when a kit match is detected (within 1 mm over the first 16 bp), the kit's published adapter is emitted in place of the in-sample consensus, so the FASTA round-trips cleanly through chelae trim --adapter-fasta on every sample in a batch.

Also fixes chelae trim --expected-insert-size so the hint actually takes effect on the first pair, and lifts shared FASTQ-reader infrastructure into commands/utils.rs.

Notable design decisions

PE empty adapter-evidence library: detect calls detect_pe_overlap with an empty OverlapAdapterLibrary so probe acceptance discovers novel adapters. Low-complexity inserts (poly-A cfDNA, microsatellites) can produce spurious "discovered" tails, but the kit/library noise floor surfaces in the report rather than hiding the real adapter when present.
Kit-match Hamming threshold = 1 (not 2). A kit match triggers FASTA substitution; 14/16-bp similarity was too loose a bar for that substitution.
u64 counters in TailAccumulator::base_counts (defensive against pathological --num-detections).
Aggressive 3'-end trim defaults in detect (poly-X min run 5, quality trim 4:20) vs trim's conservative ones — detect can afford to scrub more because it's looking for the dominant adapter signal across many reads.

Review history

This branch went through five review rounds (one independent reviewer plus an opus/sonnet/haiku adversarial pass after the major changes). Findings acted on are described in the individual commit messages; findings ruled out (with rationale) are documented inline:

SE FASTA emitting user-supplied --adapter-sequence verbatim (not substituting the kit canonical) — intentional, that's what the user explicitly asked us to score against.
PE insert >= r1_seq.len() || insert >= r2_seq.len() skipping asymmetric-read pairs — matches the docstring contract; sensitivity miss, not correctness bug.
O(N²) aggregation in aggregate_kmers — sub-second at default settings (N ≤ 5000).

Test plan

bash ci/check.sh (fmt + clippy + tests) green on the branch
67 detect tests passing (28 unit + 7 E2E + 32 validation/edge-case)
267 total tests passing
Smoke-tested against synthetic TruSeq PE data with polyG dropoff + Q0 tail
Smoke-tested against synthetic TruSeq PE with a fixed per-sample i7 barcode
Try it on a real FASTQ file from a known kit and confirm the kit match
Try it on a real FASTQ file from an unknown chemistry and confirm a sensible novel-adapter consensus
Round-trip: chelae detect -o adapters.fa → chelae trim --adapter-fasta adapters.fa on the same library, confirm trim metrics look right

`chelae detect` samples reads from one or two FASTQ files and identifies which 3' adapter sequence(s) are present. Two modes: - Paired-end: runs the same overlap probe as `chelae trim` with an empty adapter-evidence library so every probe-accepted shift becomes a candidate detection, harvests the post-template tails on each mate, and builds a position-by-position consensus across reads that landed in the same 16-bp k-mer cluster. The consensus extends as long as per- position coverage stays above an adaptive floor. - Single-end: scores each read against every built-in kit adapter plus any user `--adapter-sequence` / `--adapter-fasta` candidate via the same scanner trim uses. Reports are split into two sections: matched kit(s) with their published sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking the per-sample extension (typically the i7/i5 barcode). Optional `--output-fasta` writes a cross-sample-portable artifact: kit-matched hits emit the kit's published adapter sequence (not the sample-specific consensus), so the FASTA can be fed back into `chelae trim --adapter-fasta` and applied across an entire batch without barcode bleed-through breaking the per-sample match. Shared infrastructure with trim: BUFFER_SIZE and open_fastq_inputs lifted into commands/utils.rs; trim helpers (Adapter, OverlapStats, detect_pe_overlap, find_adapter_3prime, etc.) made pub(crate) for reuse; new load_adapter_fasta_with_names so user-curated FASTA record names survive into the detect output. Defaults are tuned for the discovery use case rather than trim's: --num-detections=5000, --min-detections-for-report=20 floor to refuse reporting on too-small samples, --adapter-min-length=10 (bumped from trim's 6 because detect scores 8+ candidates per read, raising the per-position random-match floor). README updated with examples + options table for the new subcommand.

Three follow-ups from the opus/sonnet/haiku adversarial pass: - Validate `--min-detections-for-report <= --max-reads` (previously validated only against `--num-detections`; a config like `--num-detections 200000 --max-reads 10000 --min-detections-for-report 50000` would pass validation and then guaranteed-fail mid-run). - `KIT_NEAR_MATCH_HAMMING` 2 -> 1. The FASTA writer substitutes the kit's published sequence in for any kit match (exact OR fuzzy), and 2 mm over 16 bp was too loose a bar for that substitution. A novel adapter sharing 14/16 bp with a kit prefix should not be silently relabeled as that kit; real sequencing-error fuzzy matches almost always sit at 1 mm given the per-base error rate. The ALL_KITS members are pairwise far enough that this doesn't cost legitimate matches. - `TailAccumulator::base_counts` widened to `[u64; 5]`. Defensive against pathological `--num-detections` values; theoretical-only at the default 5000. Plus a new substantive feature in response to the same review: apply the 3'-end cleanups from `chelae trim` (poly-G / poly-X / cut-right quality) to each read before it reaches the overlap probe or the candidate scan. detect's specificity bias justifies more aggressive defaults than trim's sensitivity-leaning ones: - `--trim-polyg N` default 10 (matches trim); `0` disables. - `--trim-polyx N` default 5 (down from trim's opt-in default of 10); `0` disables. - `--quality-trim W:Q` default `4:20` (smaller window than trim's `8:20` -> more aggressive); pass `off`/`none` to disable. Cut-right semantics — truncates at the start of the first failing window. Reuses the existing SIMD kernels via two new `pub(crate)` helpers: `find_polyx_tail_len` (already pure) and a freshly-extracted `cut_right_quality_position` (split out of `trim_quality_sliding_5prime` so detect doesn't need an `OwnedRecord`). Two of sonnet's findings were ruled out as non-bugs: the PE `insert >= r1_seq.len() || insert >= r2_seq.len()` skip on asymmetric reads is a sensitivity miss not a correctness bug (and matches the docstring contract); SE FASTA emitting the user's `--adapter-sequence` verbatim instead of substituting the kit canonical is the right behavior because the user explicitly asked us to score against that sequence. README options table updated. Smoke test against synthetic TruSeq + polyG dropoff + Q0 tail confirms the trims fire and the FASTA artifact stays kit-canonical.

- detect.rs: align with CLAUDE.md module-ordering convention. Move `QualityTrimSetting` (a helper enum for `Detect.quality_trim`) below `impl Command for Detect` since "higher-to-lower level" puts the CLI struct first. Move `BASES` from a body-level location up with the other top-of-file constants. Move `AnnotatedHit` up next to the other helper structs (Hit/Candidate/KitMatch/KitMate). Restore a docstring on `push_tail_kmer` that was clipped during a prior reorder. - CHANGELOG: split the previously-stale `[Unreleased]` block into a proper `[0.1.0] - 2026-05-13` entry (the v0.1.0 release), and rewrite `[Unreleased]` as a final-state diff vs v0.1.0 — not a chronological log of intermediate commits. The new entry covers the `chelae detect` subcommand, the `chelae trim --expected-insert-size` fix, and the shared `commands/utils.rs` refactor. No behavior change.

coderabbitai · 2026-06-26T13:09:33Z

📝 Walkthrough

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main change: adding the chelae detect subcommand for adapter identification.
Description check	✅ Passed	The description accurately summarizes the new detect subcommand, related fixes, and shared helper refactor.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tf_detect_cmd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

src/bin/commands/detect.rs (1)
792-815: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep TailAccumulator’s impl adjacent to the struct.

base_to_slot splits the struct from its inherent impl. Move the helper below the impl, or make it an associated helper.

As per coding guidelines, command modules require “structs/enums with immediately-following impl blocks.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bin/commands/detect.rs` around lines 792 - 815, `TailAccumulator` is
separated from its inherent impl by the `base_to_slot` helper, violating the
command-module rule that structs/enums must be immediately followed by their
impl blocks. Move `base_to_slot` below `impl TailAccumulator`, or convert it
into an associated helper on `TailAccumulator`, so the struct and its impl stay
adjacent.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bin/commands/detect.rs`:
- Around line 862-875: Remove the stale rustdoc about push_tail_kmer from the
effective_trimmed_len documentation block in detect.rs. Keep only the comments
that describe effective_trimmed_len’s 3' cleanup order and returned effective
length, and delete the bucket/k-mer insertion text so the docs match the
function name and behavior.
- Around line 1123-1163: The final human-readable report emitted by
`emit_pe_report` and `emit_se_report` is still going through `info!`, so it is
landing on stderr instead of stdout. Update the report path in these functions
to write explicitly to stdout (for example via a stdout writer/print-style
output), while leaving non-report progress logging on `info!`. Use the existing
symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and
`emit_full_length_rows` to keep the report assembly unchanged apart from the
output target.
- Around line 355-364: The paired-read loop in detect::process_paired_reads
handles R2 EOF as an error but lets R1 EOF break silently, so mismatched inputs
can be accepted. Update the reader1/reader2 matching logic so EOF on reader1
also checks whether reader2 still has data and returns an out-of-sync error when
R2 is longer, mirroring the existing R2 exhausted before R1 path. Use the
reader1.next(), reader2.next(), and process_paired_reads match block to locate
the fix.

In `@src/bin/commands/utils.rs`:
- Around line 17-34: The `open_fastq_inputs` helper is using
`Io::new(paths.len().max(1) as u32, BUFFER_SIZE)` as if the first argument
controlled reader count, but it actually sets the compression level. Update
`open_fastq_inputs` in `utils.rs` to pass a fixed compression level consistent
with the rest of the codebase (as seen in `load_adapter_fasta_with_names`), and
revise the surrounding doc comment to describe the real meaning of
`Io::new(level, buffer_size)` instead of mentioning a thread pool or input
count.

---

Nitpick comments:
In `@src/bin/commands/detect.rs`:
- Around line 792-815: `TailAccumulator` is separated from its inherent impl by
the `base_to_slot` helper, violating the command-module rule that structs/enums
must be immediately followed by their impl blocks. Move `base_to_slot` below
`impl TailAccumulator`, or convert it into an associated helper on
`TailAccumulator`, so the struct and its impl stay adjacent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5694ef6d-25c7-4fa1-b0dc-d8b0289c434c

📥 Commits

Reviewing files that changed from the base of the PR and between df015cf and 5016330.

📒 Files selected for processing (8)

CHANGELOG.md
CLAUDE.md
README.md
src/bin/commands/detect.rs
src/bin/commands/mod.rs
src/bin/commands/trim.rs
src/bin/commands/utils.rs
src/bin/main.rs

coderabbitai · 2026-06-26T13:09:36Z

+/// Emits the human-readable PE report on stdout via `info!`. Two sections:
+///   1. Matched kit(s): one line per mate per kit match, with the kit's full
+///      published adapter sequence.
+///   2. Full-length discovered consensus: per-mate, the consensus as observed
+///      in the data, with uppercase marking the kit-matched prefix and
+///      lowercase marking the extension past it (typically the i7 barcode).
+fn emit_pe_report(
+    detections: u64,
+    reads_scanned: u64,
+    overlap_hits: u64,
+    r1: &[AnnotatedHit<'_>],
+    r2: &[AnnotatedHit<'_>],
+) {
+    info!("chelae detect complete (paired-end):");
+    info!(
+        "  scanned:    {} pair(s) ({} overlap hits, {} usable detections)",
+        fmt_count(reads_scanned),
+        fmt_count(overlap_hits),
+        fmt_count(detections),
+    );
+    emit_matched_kits_section(r1.iter().chain(r2.iter()));
+    info!("  Full-length discovered consensus (uppercase = matches kit, lowercase = extension):");
+    emit_full_length_rows(r1, detections);
+    emit_full_length_rows(r2, detections);
+}
+
+/// Emits the human-readable SE report on stdout via `info!`. Same two-section
+/// layout as PE, with the section-2 sequences rendered through the same
+/// uppercase/lowercase formatter (no-op for SE because the candidate's
+/// sequence and the matched kit's sequence are typically identical).
+fn emit_se_report(detections: u64, reads_scanned: u64, reported: &[AnnotatedHit<'_>]) {
+    info!("chelae detect complete (single-end):");
+    info!(
+        "  scanned:    {} read(s); {} matched a candidate adapter",
+        fmt_count(reads_scanned),
+        fmt_count(detections),
+    );
+    emit_matched_kits_section(reported.iter());
+    info!("  Candidate adapter(s) above --min-fraction:");
+    emit_full_length_rows(reported, detections);
+}


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n 'env_logger::Builder|Target::Stdout|target\(|info!\(|println!|writeln!\(' src/bin/main.rs src/bin/commands/detect.rs

Repository: fulcrumgenomics/chelae

Length of output: 1843

Route the final report to stdout explicitly.
Current info! calls in emit_pe_report and emit_se_report write to stderr due to the env_logger default target. Redirecting stdout misses the report. Use println! or a writeln!(std::io::stdout()) wrapper for the report sections; keep progress logs on info!.

[docs: src/bin/main.rs:49, src/bin/commands/detect.rs:1136-1163]

Current Report Emitter

fn emit_pe_report(...) { info!("chelae detect complete (paired-end):"); // ... }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bin/commands/detect.rs` around lines 1123 - 1163, The final human-readable report emitted by `emit_pe_report` and `emit_se_report` is still going through `info!`, so it is landing on stderr instead of stdout. Update the report path in these functions to write explicitly to stdout (for example via a stdout writer/print-style output), while leaving non-report progress logging on `info!`. Use the existing symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and `emit_full_length_rows` to keep the report assembly unchanged apart from the output target.

- (#1) PE loop: when R1 hits EOF, check whether R2 still has records. If it does, surface an out-of-sync error mirroring the existing "R2 exhausted before R1" path, instead of silently truncating to the shorter file. Added a `pe_r1_short_produces_clear_error` test. - (#3) detect.rs: remove stale `push_tail_kmer` rustdoc that got pasted onto the top of `effective_trimmed_len`'s doc block during the housekeeping pass. - (#4) utils.rs: `Io::new`'s first argument is the gzip compression level, not a thread-pool size. The original trim code threaded `paths.len()` there in the mistaken belief it sized a decompression pool; it doesn't (fgoxide's readers are synchronous and the level is only consulted by writers). Switched to `Io::new(5, BUFFER_SIZE)` matching `chelae trim`'s other use sites, and rewrote the doc comment to describe the real meaning. - (#5) detect.rs: `base_to_slot` now lives inside `impl TailAccumulator`, restoring the "struct immediately followed by its impl block" adjacency rule that the housekeeping pass had broken. Call sites updated to `TailAccumulator::base_to_slot`. Skipped (#2): the `info!`-vs-stdout report-routing finding. The project-wide pattern is that end-of-run summaries go through `info!` (matches `chelae trim`'s summary path); changing detect alone would introduce an inconsistency. Left for a separate decision.

Deep-review pass on the branch surfaced ten fix-in-place items plus one behavioral question about how detect emits novel adapters that have a variable region (e.g. i7 barcode past a kit-stable prefix). Consensus cut (behavioral, was pure majority): TailAccumulator::consensus now tracks a running-min baseline of the per-column majority fraction and stops at (a) coverage < min_coverage, (b) majority < 50%, or (c) a drop of > 10 pp below the baseline. Downstream aggregate_kmers drops any primary whose consensus can't make it to TAIL_KMER_LEN — a cluster that heterogeneous inside the k-mer window shouldn't be reported. Rationale: previously a variable 16 fixed + N variable adapter (imbalanced pool where one index dominates at 25%) would silently emit that plurality base as canonical. Discontinuity cut catches the transition cleanly. Behavioral fixes: * PE / SE --output-fasta now hard-fails when either mate (PE) or the whole set (SE) fails to clear --min-fraction, instead of writing an empty or half-empty FASTA that downstream trim would silently accept. * load_adapter_fasta_with_names now preserves headers with empty bodies (>foo\n>bar\nACGT) and rejects them at validate time, not silently drops them. * --output-fasta paths ending in .gz are transparently gzipped via fgoxide::Io::new_writer instead of writing raw bytes to a .gz extension. * detect --adapter-sequence <SEQ> shorter than --adapter-min-length is rejected in validate() (would have silently never matched). * --adapter-sequence dropped num_args = 1.. so a stray positional after -a is no longer greedily consumed as another adapter value. Doc / API-surface hygiene: * SE-only flag docstrings now say the flag errors on PE input instead of "Ignored on PE" (mismatched validate() behavior). * --quality-trim doc lists 'no' as a disable keyword; README the same. --quality-trim comparison to trim's default is now accurately worded ("trim's opt-in --quality-trim-3p default"). * Zero-detection PE error now names --min-tail-length as a remedy when overlap_hits > 0. * PE min-tail-length comment now describes what the code actually does (both mates required; shared detections denominator makes asymmetric harvesting unsound). * trim.rs items promoted for detect's use tightened back down: load_adapter_fasta reverted to private (only _with_names crosses the module boundary); OverlapAdapterLibrary struct stays pub(crate) but fields and is_empty() go back to private. * main.rs comment no longer claims Trim is the only variant. * utils.rs now imports PathBuf so the fully-qualified signatures match the rest of the file's style. * push_tail_kmer drops the redundant tail.to_ascii_uppercase() (observe already case-folds via base_to_slot). Tests added: * consensus_extends_through_steady_high_majority * consensus_extends_through_gently_declining_majority * consensus_cuts_at_sharp_discontinuity * consensus_cuts_at_absolute_floor_when_pool_dominates * consensus_stops_when_coverage_below_min * aggregate_kmers_drops_primaries_with_short_consensus * se_below_detection_floor_errors_instead_of_reporting * validate_accepts_floor_equal_to_num_detections * validate_accepts_floor_equal_to_max_reads * validate_rejects_user_candidate_shorter_than_adapter_min_length * pe_output_fasta_errors_when_min_fraction_excludes_all_hits * se_output_fasta_errors_when_min_fraction_excludes_all_hits * load_adapter_fasta_errors_on_header_without_body Verification: bash ci/check.sh (fmt + clippy + 276 tests) green; smoke test on a real PE library (SRR873822) still identifies TruSeq exact match on both mates with kit-canonical sequences in the FASTA and gzip-by-extension working.

tfenne · 2026-07-02T18:09:18Z

@coderabbitai review

coderabbitai · 2026-07-02T18:09:25Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

tfenne added 3 commits June 26, 2026 05:10

coderabbitai Bot requested changes Jun 26, 2026

View reviewed changes

tfenne added 2 commits June 26, 2026 07:38

tfenne merged commit c69b494 into main Jul 2, 2026
5 checks passed

tfenne deleted the tf_detect_cmd branch July 2, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

detect: add chelae detect subcommand for adapter identification#8

detect: add chelae detect subcommand for adapter identification#8
tfenne merged 5 commits into
mainfrom
tf_detect_cmd

tfenne commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Uh oh!

tfenne commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tfenne commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notable design decisions

Review history

Test plan

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tfenne commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tfenne commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading